Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call rebalance API once all pods are updated #625

Merged

Conversation

radu-gheorghe
Copy link
Contributor

Timid attempt to fix #615

It works on my machine, it seems to fix the problem, but:

  • I have no tests. Yet. I have a feeling this is the hard part 🙂
  • I guess I'll need to update managed-updates.md a bit as well.
  • There are a couple of TODOes in my patch where I'm not sure if things are right - any feedback there would be great!
  • Last but not least, with or without my patch, I always got a DOWN shard at the end of my rolling update. That's a separate issue, right? Still, it's always reproducible in my environment, maybe due to the fact that I'm testing with some docs in my collection and I'm having my IDE (with Solr Operator) talk to Kubernetes via a port forward to the "common" service, which sometimes fails because the underlying pod gets restarted, so I get a few "connection refused" errors until I put it back. Still, it should be something that the Operator handles gracefully, IMO.

Any feedback is welcome, I'd be glad to push this forward!

Copy link
Contributor

@HoustonPutman HoustonPutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I have two edge cases that I think we need to account for:

  1. In the case that all the pods are updates, the async rebalance command will be started. The user can then change the spec again, and the updatedPodCount will then again be 0, because all pods will be out of date. At that point we will have forgotten about the rebalance command that we started. We need to make sure that we complete any async command before we start a new one, so I suggest that we split off the async command to a separate ClusterOp, that we start after the restart is complete. That way another restart will only begin once the rebalance is finished.
  2. If I'm correct, we are rebalancing even if replicas weren't moved around during the rolling restart. In my opinion, we should only rebalance if replicas were moved around.

Copy link
Contributor

@HoustonPutman HoustonPutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go @radu-gheorghe , give it a look!

@radu-gheorghe
Copy link
Contributor Author

Looks good to me, too, thanks a lot @HoustonPutman!

I already had a look yesterday, too, and noticed you couldn't help yourself and fixed your 2) above 🙂

@HoustonPutman
Copy link
Contributor

I already had a look yesterday, too, and noticed you couldn't help yourself and fixed your 2) above 🙂

Haha I remembered that we had the lovely readinessConditions that told us if the cloud was ephemeral, so it was very easy to do it!

@HoustonPutman HoustonPutman merged commit 0ce9517 into apache:main Oct 10, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rebalance ephemeral Solr cloud after a rolling restart
2 participants